Website Markdown Crawler avatar

Website Markdown Crawler

Pricing

from $2.00 / 1,000 website analyzeds

Go to Apify Store
Website Markdown Crawler

Website Markdown Crawler

Crawls a website and converts every page to clean Markdown optimized for LLM ingestion.

Pricing

from $2.00 / 1,000 website analyzeds

Rating

0.0

(0)

Developer

Ziad Tarik

Ziad Tarik

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

6 hours ago

Last modified

Share

Crawls a website starting from a seed URL and converts every page to clean Markdown optimized for LLM ingestion (LlamaIndex, LangChain, OpenAI, Pinecone). Output includes structured metadata per page: title, language detected, publication date, headings outline, word count, and chunked content ready for vector store upsert.

Features

  • Clean Markdown Extraction: Strips noise (navigation, footers) to extract just the main content.
  • Smart Chunking: Splits content into token chunks respecting paragraph boundaries.
  • Language Filtering: Can automatically detect and filter pages by language (e.g., only en or fr).
  • Domain Control: Keeps the crawler scoped to the seed URL's domain.
  • Regex Exclusions: Skip non-valuable URLs like tags or author pages.

Output Example

Each crawled page yields a structured JSON record:

{
"url": "https://docs.example.com/getting-started",
"title": "Getting Started — Example Docs",
"description": "Learn how to set up Example in 5 minutes.",
"language": "en",
"wordCount": 842,
"tokenEstimate": 1120,
"headings": [
{ "level": 1, "text": "Getting Started" },
{ "level": 2, "text": "Installation" }
],
"markdown": "# Getting Started\n\nLearn how to...",
"chunks": [
{ "index": 0, "content": "# Getting Started\n\nLearn how to...", "tokenEstimate": 498 }
],
"chunkCount": 1,
"depth": 1,
"crawledAt": "2026-06-10T14:32:00.000Z"
}

Integrations

Connect the crawler directly into your RAG stack.

LlamaIndex

from llama_index.core import Document
# After running the Actor, download dataset as JSON
docs = [
Document(text=chunk['content'], metadata={'url': item['url'], 'chunk': chunk['index']})
for item in dataset_items
for chunk in item['chunks']
]

LangChain

from langchain.docstore.document import Document as LCDoc
lc_docs = [
LCDoc(page_content=chunk['content'], metadata={'source': item['url']})
for item in dataset_items
for chunk in item['chunks']
]